Stable Diffusion is a

deep learning Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised. De ...

text-to-image model A text-to-image model is a machine learning model which takes as input a natural language description and produces an image matching that description. Such models began to be developed in the mid-2010s, as a result of advances in deep neural netwo ...

released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as

inpainting Inpainting is a conservation process where damaged, deteriorated, or missing parts of an artwork are filled in to present a complete image. This process is commonly used in image restoration. It can be applied to both physical and digital art me ...

, outpainting, and generating image-to-image translations guided by a text prompt. Stable Diffusion is a

latent Latency or latent may refer to: Science and technology * Latent heat, energy released or absorbed, by a body or a thermodynamic system, during a constant-temperature process * Latent variable, a variable that is not directly observed but inferred ...

diffusion model In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference. The goal of diffusion models is to learn the latent structure o ...

, a kind of deep generative

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

developed by the CompVis group at

LMU Munich The Ludwig Maximilian University of Munich (simply University of Munich or LMU; german: Ludwig-Maximilians-Universität München) is a public research university in Munich, Germany. It is Germany's sixth-oldest university in continuous operatio ...

. The model has been released by a collaboration of Stability AI, CompVis LMU, and Runway with support from EleutherAI and

LAION LAION (acronym for Large-scale Artificial Intelligence Open Network) is a German non-profit which makes open-sourced artificial intelligence models and datasets. It is best known for releasing a number of large datasets of images and captions scr ...

. In October 2022, Stability AI raised US$101 million in a round led by

Lightspeed Venture Partners Lightspeed Venture Partners is a global venture capital firm focusing on multi-stage investments in the enterprise, consumer, and health sectors. Lightspeed invests in seed, early and growth-stage companies. The company invests in the U.S. and a ...

and

Coatue Management Coatue is an American technology-focused investment manager led by founder and portfolio manager Philippe Laffont. Coatue invests in public and private markets with a focus on technology, media, telecommunications. the consumer and healthcare se ...

. Stable Diffusion's code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest

GPU A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobil ...

with at least 8 GB

VRAM Video random access memory (VRAM) is dedicated computer memory used to store the pixels and other graphics data as a framebuffer to be rendered on a computer monitor. This is often different technology than other computer memory, to facilitate b ...

. This marked a departure from previous proprietary text-to-image models such as

DALL-E DALL-E (stylized as DALL·E) and DALL-E 2 are deep learning models developed by OpenAI to generate digital images from natural language descriptions, called "prompts". DALL-E was revealed by OpenAI in a blog post in January 2021, and uses a ver ...

and

Midjourney Midjourney is an independent research lab that produces an artificial intelligence program under the same name that creates images from textual descriptions, similar to OpenAI's DALL-E and Stable Diffusion. It is speculated that the underlying t ...

which were accessible only via

cloud service Cloud computing is the on-demand availability of computer system resources, especially data storage (cloud storage) and computing power, without direct active management by the user. Large clouds often have functions distributed over multip ...

Technology

X-Y plot of algorithmically-generated AI art of European-style castle in Japan demonstrating DDIM diffusion steps

Architecture

Stable Diffusion uses a kind of

(DM), called a latent diffusion model (LDM). Introduced in 2015, diffusion models are trained with the objective of removing successive applications of

Gaussian noise Gaussian noise, named after Carl Friedrich Gauss, is a term from signal processing theory denoting a kind of signal noise that has a probability density function (pdf) equal to that of the normal distribution (which is also known as the Gaussia ...

on training images which can be thought of as a sequence of

denoising autoencoder An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning). The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder lear ...

s. Stable Diffusion consists of 3 parts: the

variational autoencoder In machine learning, a variational autoencoder (VAE), is an artificial neural network architecture introduced by Diederik P. Kingma and Max Welling, belonging to the families of probabilistic graphical models and variational Bayesian methods. ...

(VAE),

U-Net U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on the fully convolutional network and its architecture was modifie ...

, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional

latent space A latent space, also known as a latent feature space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. Position within the latent s ...

, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.

Training data

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from

Common Crawl Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...

data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality). The dataset was created by

, a German non-profit which receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with

Pinterest Pinterest is an American image sharing and social media service designed to enable saving and discovery of information (specifically "ideas") on the internet using images, and on a smaller scale, animated GIFs and videos, in the form of pinboard ...

taking up 8.5% of the subset, followed by websites such as

WordPress WordPress (WP or WordPress.org) is a free and open-source content management system (CMS) written in hypertext preprocessor language and paired with a MySQL or MariaDB database with supported HTTPS. Features include a plugin architecture ...

Blogspot Blogger is an American online content management system founded in 1999 which enables multi-user blogs with time-stamped entries. Pyra Labs developed it before being acquired by Google in 2003. Google hosts the blogs, which can be accessed thr ...

Flickr Flickr ( ; ) is an American image hosting and video hosting service, as well as an online community, founded in Canada and headquartered in the United States. It was created by Ludicorp in 2004 and was a popular way for amateur and professional ...

DeviantArt DeviantArt (historically stylized as deviantART) is an American online art community that features artwork, videography and photography, launched on August 7, 2000 by Angelo Sotira, Scott Jarkoff, and Matthew Stephens among others. DeviantArt, ...

and

Wikimedia Commons Wikimedia Commons (or simply Commons) is a media repository of free-to-use images, sounds, videos and other media. It is a project of the Wikimedia Foundation. Files from Wikimedia Commons can be used across all of the Wikimedia projects in ...

Training procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a

watermark A watermark is an identifying image or pattern in paper that appears as various shades of lightness/darkness when viewed by transmitted light (or when viewed by reflected light, atop a dark background), caused by thickness or density variations ...

with greater than 80% probability. Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance. The model was trained using 256 Nvidia A100 GPUs on

Amazon Web Services Amazon Web Services, Inc. (AWS) is a subsidiary of Amazon.com, Amazon that provides Software as a service, on-demand cloud computing computing platform, platforms and Application programming interface, APIs to individuals, companies, and gover ...

for a total of 150,000 GPU-hours, at a cost of $600,000.

Limitations

Stable Diffusion has issues with degradation and inaccuracies in certain scenarios. Initial releases of the model were trained on a dataset that consists of 512×512 resolution images, meaning that the quality of generated images noticeably degrades when user specifications deviate from its "expected" 512×512 resolution; the version 2.0 update of the Stable Diffusion model later introduced the ability to natively generate images at 768×768 resolution. Another challenge is in generating human limbs due to poor data quality of limbs in the LAION database. The model is insufficiently trained to understand human limbs and faces due to the lack of representative features in the database, and prompting the model to generate images of such type can confound the model. Accessibility for individual developers can also be a problem. In order to customize the model for new use cases that are not included in the dataset such as generating

anime is Traditional animation, hand-drawn and computer animation, computer-generated animation originating from Japan. Outside of Japan and in English, ''anime'' refers specifically to animation produced in Japan. However, in Japan and in Japane ...

characters ("waifu diffusion"), new data and further training are required. Fine-tuned adaptations of Stable Diffusion created through additional retraining have been used for a variety of different use-cases, from medical imaging to algorithmically-generated music. However, this fine-tuning process is sensitive to the quality of new data; low resolution images or different resolutions from the original data can not only fail to learn the new task but degrade the overall performance of the model. Even when the model is additionally trained on high quality images, it is difficult for individuals to run models in consumer electronics. For example, the training process for waifu-diffusion requires a minimum 30 GB of

, which exceeds the usual resource provided in consumer GPUs, such as

Nvidia Nvidia CorporationOfficially written as NVIDIA and stylized in its logo as VIDIA with the lowercase "n" the same height as the uppercase "VIDIA"; formerly stylized as VIDIA with a large italicized lowercase "n" on products from the mid 1990s to ...

’s

GeForce 30 series The GeForce 30 series is a suite of graphics processing units (GPUs) designed and marketed by Nvidia, succeeding the GeForce 20 series. The GeForce 30 series is based on the Ampere architecture, which feature Nvidia's second-generation ray trac ...

having around 12 GB. The creators of Stable Diffusion acknowledge the potential for

algorithmic bias Algorithmic bias describes systematic and repeatable errors in a computer system that create "unfair" outcomes, such as "privileging" one category over another in ways different from the intended function of the algorithm. Bias can emerge from ...

, as the model was primarily trained on images with English descriptions. As a result, generated images reinforce social biases and are from a western perspective as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages with western or white cultures often being the default representation.

End-user fine tuning

To address the limitations of the model's initial training, end-users may opt to implement additional training for the purpose of fine-tuning generation outputs to match more specific use-cases. There are three methods in which user-accessible fine tuning can be applied to a Stable Diffusion model checkpoint: *An "embedding" can be trained from a collection of user-provided images, and allows the model to generate visually similar images whenever the name of the embedding is used within a generation prompt. Embeddings are based on the "textual inversion" concept developed by researchers from

Tel Aviv University Tel Aviv University (TAU) ( he, אוּנִיבֶרְסִיטַת תֵּל אָבִיב, ''Universitat Tel Aviv'') is a public research university in Tel Aviv, Israel. With over 30,000 students, it is the largest university in the country. Locate ...

in 2022 with support from

, where vector representations for specific tokens used by the model's text encoder are linked to new pseudo-words. Embeddings can be used to reduce biases within the original model, or mimic visual styles. *A "hypernetwork" is a small pre-trained neural network that is applied to various points within a larger neural network, and refers to the technique created by NovelAI developer Kurumuz in 2021, originally intended for text-generation transformer models. Hypernetworks steer results towards a particular direction, allowing Stable Diffusion-based models to imitate the art style of specific artists, even if the artist is not recognised by the original model; they process the image by finding key areas of importance such as hair and eyes, and then patch these areas in secondary latent space. *

DreamBooth DreamBooth is a deep learning generation model used to Text-to-image personalization, personalize existing text-to-image models by fine-tuning (machine learning), fine-tuning. It was developed by researchers from Google, Google Research and Bosto ...

is a deep learning generation model developed by researchers from Google Research and

Boston University Boston University (BU) is a private research university in Boston, Massachusetts. The university is nonsectarian, but has a historical affiliation with the United Methodist Church. It was founded in 1839 by Methodists with its original campu ...

in 2022 which can fine-tune the model to generate precise, personalised outputs that depict a specific subject, following training via a set of images which depict the subject.

Capabilities

The Stable Diffusion model supports the ability to generate new images from scratch through the use of a text prompt describing elements to be included or omitted from the output. Existing images can be re-drawn by the model to incorporate new elements described by a text prompt (a process known as "guided image synthesis") through its diffusion-denoising mechanism. In addition, the model also allows the use of prompts to partially alter existing images via

and outpainting, when used with an appropriate user interface that supports such features, of which numerous different open source implementations exist. Stable Diffusion is recommended to be run with 10 GB or more VRAM, however users with less VRAM may opt to load the weights in float16 precision instead of the default float32 to tradeoff model performance with lower VRAM usage.

Text to image generation

The text to image sampling script within Stable Diffusion, known as "txt2img", consumes a text prompt in addition to assorted option parameters covering sampling types, output image dimensions, and seed values. The script outputs an image file based on the model's interpretation of the prompt. Generated images are tagged with an invisible

digital watermark A digital watermark is a kind of marker covertly embedded in a noise-tolerant signal such as audio, video or image data. It is typically used to identify ownership of the copyright of such signal. "Watermarking" is the process of hiding digital inf ...

to allow users to identify an image as generated by Stable Diffusion, although this watermark loses its efficacy if the image is resized or rotated. Each txt2img generation will involve a specific

seed value In mathematics and particularly in dynamic systems, an initial condition, in some contexts called a seed value, is a value of an evolving variable at some point in time designated as the initial time (typically denoted ''t'' = 0). F ...

which affects the output image. Users may opt to randomize the seed in order to explore different generated outputs, or use the same seed to obtain the same image output as a previously generated image. Users are also able to adjust the number of inference steps for the sampler; a higher value takes a longer duration of time, however a smaller value may result in visual defects. Another configurable option, the classifier-free guidance scale value, allows the user to adjust how closely the output image adheres to the prompt. More experimentative use cases may opt for a lower scale value, while use cases aiming for more specific outputs may use a higher value. Additional text2img features are provided by front-end implementations of Stable Diffusion, which allow users to modify the weight given to specific parts of the text prompt. Emphasis markers allow users to add or reduce emphasis to keywords by enclosing them with brackets. An alternative method of adjusting weight to parts of the prompt are "negative prompts". Negative prompts are a feature included in some front-end implementations, including Stability AI's own DreamStudio cloud service, and allow the user to specify prompts which the model should avoid during image generation. The specified prompts may be undesirable image features that would otherwise be present within image outputs due to the positive prompts provided by the user, or due to how the model was originally trained, with mangled human hands being a common example.

Image modification

Stable Diffusion also includes another sampling script, "img2img", which consumes a text prompt, path to an existing image, and strength value between 0.0 and 1.0. The script outputs a new image based on the original image that also features elements provided within the text prompt. The strength value denotes the amount of noise added to the output image. A higher strength value produces more variation within the image but may produce an image that is not semantically consistent with the prompt provided. The ability of img2img to add noise to the original image makes it potentially useful for

data anonymization Data anonymization is a type of information sanitization whose intent is privacy protection. It is the process of removing personally identifiable information from data sets, so that the people whom the data describe remain anonymous. Overv ...

and

data augmentation Data augmentation in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce over ...

, in which the visual features of image data are changed and anonymized. The same process may also be useful for image upscaling, in which the resolution of an image is increased, with more detail potentially being added to the image. Additionally, Stable Diffusion has been experimented with as a tool for image compression. Compared to

JPEG JPEG ( ) is a commonly used method of lossy compression for digital images, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and imag ...

and

WebP WebP is an image file format developed by Google intended as a replacement for JPEG, PNG, and GIF file formats. It supports both lossy and lossless compression, as well as animation and alpha transparency. Google announced the WebP format i ...

, the recent methods used for image compression in Stable Diffusion face limitations in preserving small text and faces. Additional use-cases for image modification via img2img are offered by numerous front-end implementations of the Stable Diffusion model. Inpainting involves selectively modifying a portion of an existing image delineated by a user-provided layer mask, which fills the masked space with newly generated content based on the provided prompt. A dedicated model specifically fine-tuned for inpainting use-cases was created by Stability AI alongside the release of Stable Diffusion 2.0. Conversely, outpainting extends an image beyond its original dimensions, filling the previously empty space with content generated based on the provided prompt. A depth-guided model, named "depth2img", was introduced with the release of Stable Diffusion 2.0 on November 24, 2022; this model infers the depth of the provided input image, and generates a new output image based on both the text prompt and the depth information, which allows the coherence and depth of the original input image to be maintained in the generated output.

Usage and controversy

Stable Diffusion claims no rights on generated images and freely gives users the rights of usage to any generated images from the model provided that the image content is not illegal or harmful to individuals. The freedom provided to users over image usage has caused controversy over the ethics of ownership, as Stable Diffusion and other generative models are trained from copyrighted images without the owner’s consent. As

visual style In the visual arts, style is a "...distinctive manner which permits the grouping of works into related categories" or "...any distinctive, and therefore recognizable, way in which an act is performed or an artifact made or ought to be performed a ...

s and

composition Composition or Compositions may refer to: Arts and literature *Composition (dance), practice and teaching of choreography *Composition (language), in literature and rhetoric, producing a work in spoken tradition and written discourse, to include v ...

s are not subject to copyright, it is often interpreted that users of Stable Diffusion who generate images of artworks should not be considered to be infringing upon the copyright of visually similar works. However, individuals depicted in generated images may be protected by

personality rights Personality rights, sometimes referred to as the right of publicity, are rights for an individual to control the commercial use of their identity, such as name, image, likeness, or other unequivocal identifiers. They are generally considered as ...

if their likeness is used, and

intellectual property Intellectual property (IP) is a category of property that includes intangible creations of the human intellect. There are many types of intellectual property, and some countries recognize more than others. The best-known types are patents, cop ...

such as recognizable brand logos still remain protected by copyright. Nonetheless, visual artists have expressed concern that widespread usage of image synthesis software such as Stable Diffusion may eventually lead to human artists, along with photographers, models, cinematographers, and actors, gradually losing commercial viability against AI-based competitors. Stable Diffusion is notably more permissive in the types of content users may generate, such as violent or sexually explicit imagery, in comparison to other commercial products based on generative AI. Addressing the concerns that the model may be used for abusive purposes, CEO of Stability AI, Emad Mostaque, explains that "

t is T, or t, is the twentieth letter in the Latin alphabet, used in the modern English alphabet, the alphabets of other western European languages and others worldwide. Its name in English is ''tee'' (pronounced ), plural ''tees''. It is deri ...

peoples' responsibility as to whether they are ethical, moral, and legal in how they operate this technology", and that putting the capabilities of Stable Diffusion into the hands of the public would result in the technology providing a net benefit, in spite of the potential negative consequences. In addition, Mostaque argues that the intention behind the open availability of Stable Diffusion is to end corporate control and dominance over such technologies, who have previously only developed closed AI systems for image synthesis. This is reflected by the fact that any restrictions Stability AI places on the content that users may generate can easily be bypassed due to the availability of the source code.

Litigation

In January of 2023, three artists:

Sarah Andersen Sarah Andersen is an American cartoonist and illustrator, and the author of the webcomic '' Sarah's Scribbles''. Biography Andersen graduated from the Maryland Institute College of Art (MICA) in 2014. While attending MICA, she started drawing t ...

, Kelly McKernan, and Karla Ortiz filed a copyright infringement lawsuit against Stability AI,

, and

, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists. The same month, Stability AI was also sued by

Getty Images Getty Images Holdings, Inc. is an American visual media company and is a supplier of stock images, editorial photography, video and music for business and consumers, with a library of over 477 million assets. It targets three markets— creative ...

for using its images in the training data.

License

Unlike models like

, Stable Diffusion makes its source code available, along with pretrained weights. Its license prohibits certain use cases, including crime,

libel Defamation is the act of communicating to a third party false statements about a person, place or thing that results in damage to its reputation. It can be spoken (slander) or written (libel). It constitutes a tort or a crime. The legal defini ...

harassment Harassment covers a wide range of behaviors of offensive nature. It is commonly understood as behavior that demeans, humiliates or embarrasses a person, and it is characteristically identified by its unlikelihood in terms of social and moral ...

doxing Doxing or doxxing is the act of publicly providing personally identifiable information about an individual or organization, usually via the internet. Historically, the term has been used interchangeably to refer to both the aggregation of this i ...

, "exploiting ... minors", giving medical advice, automatically creating legal obligations, producing legal evidence, and "discriminating against or harming individuals or groups based on ... social behavior or ... personal or personality characteristics ... r legally protected characteristics or categories". The user owns the rights to their generated output images, and is free to use them commercially.

References

External links

Stable Diffusion Demo
{{Differentiable computing Deep learning software applications Text-to-image generation Unsupervised learning Art controversies